Multi-Scale Context Aggregation by Dilated Convolutions

(ICLR 2016) Fisher .et al

Research Objective

The goal is to compute a discrete or continuous label for each pixel in the image。

Semantic segmentation, which calls for classifying each pixel into one of a given set of categories。

Problem Statement

Former convolutional network architectures purposed for dense prediction were originally been developed for image classification.

Therefore, it would be worth exploring about which aspects of the repurposed networks are truly necessary and which reduce accuracy when operated densely.

And we need modules designed specifically for dense prediction to improve accuracy further.

Method(s)

dilated convolution

Let $F$ be a image and $k$ be a convolution filter, $l$-dilated convolution can be formed as
$$
(F*lk)(p)=\sum{s+lt=p}F(s)k(t)
$$

Characteristic

dilated convolutions support exponentially expanding receptive fields without losing resolution or coverage.

Considerapplying $3\times3$ filters with exponentially increasing dilation:
$$
F_{i+1}=F_i*_{2^i}k_i
$$
It is easy to see that the size of the receptive field of each element in $F_{i+1}$ is $(2^{i+2}-1)\times(2^{i+2}-1)$

multi-scale context aggregation module

The module takes $C$ feature maps as input and produces $C$ feature maps as output. The input and output have the same form, thus the module can be plugged into existing dense prediction architectures.

The basic context module has $7$ layers that apply $3\times3$ convolutions with different dilation factors. The dilations are $1, 1, 2, 4, 8, 16,$ and $1$. Each of these convolutions is followed by a pointwise truncation $max(\cdot; 0)$. A final layer performs $1\times1\times C$ convolutions and produces the output of the module.

Note that the frontend module that provides the input to the context network in our experiments produces feature maps at $64\times64$ resolution. We therefore stop the exponential expansion of the receptive field after layer 6.

identity initialization

Authors found that random initialization schemes were not effective for the context module. A form of identity initialization was used:
$$
k^b(t,a)=1_{[t=0]}1_{[a=b]}
$$
This initialization sets all filters such that each layer simply passes the input directly to the next.

Conclusion

the dilated convolution operator is particularly suited to dense prediction due to its ability to expand the receptive field without losing resolution or coverage.

the accuracy of existing convolutional networks for semantic segmentation can be increased by removing vestigial components that had been developed for image classification